Overview: Script for dissolved oxygen data from water stations. Sources and information on data below. First I remove data points that are flagged as failed and suspect. Then I use a rolling mean to remove data points +/- 2 standard deviations from the 2hr rolling mean. Finally I calculate hourly mean since data was collected at different frequencies. This will set me up to do analysis between sites and with field data. Sites graphed from cloest to the delta to the mouth of the bay.

Notes on data:

-Accessed and downloaded all data on September 30, 2020 -Downloaded all data available from 1/1/2017-12/31/2019. -China camp http://cdmo.baruch.sc.edu/aqs/zips.cfm Data logged every 15 minutes. Requested citation format: NOAA National Estuarine Research Reserve System (NERRS). System-wide Monitoring Program. Data accessed from the NOAA NERRS Centralized Data Management Office website: http://cdmo.baruch.sc.edu/; accessed 30 September 2020. -Richardson Bay/NERR data. Data logged every 15minutes. -EOS https://oceanview.pfeg.noaa.gov/erddap/tabledap/rtcctdRTCysi.html. Data logged every 6 minutes. time in UTC. Should I use this link? http://erddap.cencoos.org/erddap/tabledap/tiburon-water-tibc1.html. ph and salinity available. No air temperature. -Point Fort http://erddap.cencoos.org/erddap/tabledap/fort-point.html?time%2Csea_water_temperature%2Csea_water_temperature_qc_agg%2Cz&time%3E%3D2020-09-20T21%3A07%3A41Z&time%3C%3D2020-09-29T12%3A00%3A00Z. time in UTC. Data logged every 6 minutes.

Next steps: -fix the aesthetics of the graphs. Font too small, change gradient, clean up keys, ect -analysis with field data -site characterization summary table

Set up

rm(list=ls())

library(tidyverse)
library(ggpubr)
library(scales)
library(chron)
library(plotly)
library(taRifx)
library(aweek)
library(easypackages)
library(renv)
library(here)
library(ggthemes)
library(gridExtra)
library(patchwork)
library(tidyquant)
library(recipes) 
library(cranlogs)
library(knitr)
library(openair)

Make a custom function to calculate mean, standard deviation (sd), 95% confidence interval where x is a numeric vector. Here “hi” and “lo” are defined as 2 strandard deviations away from mean. This creates a functin so you don’t need to change anything here/this step does not manipulate the data. You can change the equation to define the hi and low as 3 or however many sds away you would like. na.rm=TRUE removes any “NA” values. “ret” are the new columns it’ll add when you run this on a df.

custom_stat_fun_2<-function(x, na.rm=TRUE){
  m<-mean(x, na.rm=TRUE)
  s<- sd(x, na.rm=TRUE)
  hi<-m+2*s
  lo<-m-2*s
  ret<-c(mean=m, stdev=s, hi.95=hi, lo.95=lo)
}

China Camp
Read in data. Subset (not needed but it’s a large df). Need a column named “date” with no time stamp for step later.

cc<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/china_camp_all.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)

cc$date<-as.POSIXct(cc$DateTimeStamp, format = c("%m/%d/%Y"))

cc$datetime<-as.POSIXct(cc$DateTimeStamp, format = c("%m/%d/%Y %H:%M"))

cc<-subset(cc, select=c(date, datetime, DO_mgl, F_DO_mgl))

cc%>%
  ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from China Camp: All data points",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 847 rows containing missing values (geom_point).

Remove data that failed QC test

cc<-cc[- grep("-3", cc$F_DO_mgl),]

cc%>%
  ggplot(aes(x=date, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from China Camp: Failed QC points removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 835 rows containing missing values (geom_point).

Remove data points flagged as “suspect” too

cc<-cc[- grep("1", cc$F_DO_mgl),]

cc%>%
  ggplot(aes(x=date, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from China Camp: Failed & suspect QC points removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 835 rows containing missing values (geom_point).

Still a lot of noise.

Roll apply using custom stat function. Needs a column named “date” with no time stamp in order to work. Since data was collected every 15min, 8 observations=2hour window.

2 hour

cc<-cc%>%
  tq_mutate(
    select= DO_mgl,
    mutate_fun= rollapply,
    width= 8,
    align= "right",
    by.column= FALSE,
    FUN= custom_stat_fun_2,
    na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique

Filter to remove calues that are +/- 2 standard deviations from the rolling mean

cc<-filter(cc, DO_mgl <hi.95 & DO_mgl >lo.95)

cc%>%
  ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from China Camp: Failed QC points removed + +/- 2sd removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 9 rows containing missing values (geom_point).

Removed some of the noise but not as many as I’d like. Should I consider a different window length than 2hrs?

Hourly median
For timeAverage function to work, you need a date or POSIXct column named “date” but it needs the timestamp to work so we’ll add it back in. Interval = frequency of data collection.

cc<-subset(cc, select=-c(date))
cc$date<-as.POSIXct(cc$datetime, format = c("%m/%d/%Y %H:%M"))

cc.1hr.med<- timeAverage(cc,
            avg.time="1 hour",
            data.thresh = 0,
            statistic = "median",
            interval = "15 min",
            remove.calm = FALSE,
            type = "default")
## Warning: Missing dates detected, removing 9 lines
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
cc.1hr.med<-subset(cc.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))

cc_do<-cc.1hr.med %>% ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+ylim(0,25)+
  labs(title="Hourly median dissolved oxygen from China Camp",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")

cc_do
## Warning: Removed 1857 rows containing missing values (geom_point).

Still seems like a lot of noise to me. Maybe using a larger window on the rolling mean?

Format and save

cc.1hr.med$date<-as.Date(cc.1hr.med$datetime, format = c("%Y-%m-%d"))

names(cc.1hr.med)[names(cc.1hr.med) == "DO_mgl"] <- "do"

write.csv(cc.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/cc_do.csv")

EOS pier

Set up

eos<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/eos_bouy_r.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)

eos$date<-as.Date(eos$date, format = c("%m/%d/%Y"))
eos$datetime<-as.POSIXct(eos$datetime, format = c("%Y-%m-%dT%H:%M:%SZ"))

eos<-subset(eos, select=c(date, datetime, odo))

eos%>%
  ggplot(aes(x=datetime, y=odo, color=odo))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from EOS resesarch pier: All data",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of CeNCOOS")

Plot with better view

eos%>%
  ggplot(aes(x=datetime, y=odo, color=odo))+
  geom_point(alpha=0.5)+ylim(0,15)+
  labs(title="Dissolved oxygen data from EOS resesarch pier: All data",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of CeNCOOS")
## Warning: Removed 1 rows containing missing values (geom_point).

That low value chunk at the end of 2018 looks suspicious.

Since this data set doesn’t have a QC test column, I’ll remove values that are far outside expected

eos<-filter(eos, odo >0)

eos%>%
  ggplot(aes(x=datetime, y=odo, color=odo))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from EOS resesarch pier: Values outside of expected range removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of CeNCOOS")

The lower values at the end of 2018 correspond with the low salinity event I see with this data set too so I need to verify.

Roll apply using custom stat function. Data collected every 6 minutes so 20 observations=2 hours. Needs a column named “date” with no time stamp in order to work.

2hr

eos<-eos%>%
  tq_mutate(
    select= odo,
    mutate_fun= rollapply,
    width= 20,
    align= "right",
    by.column= FALSE,
    FUN= custom_stat_fun_2,
    na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique

Filter and graph

eos<-filter(eos, odo <hi.95 & odo >lo.95)


ggplot(data = eos, mapping = aes(x = datetime, y = odo, color=odo)) + geom_point()+
  labs(title="Dissolved oxygen data from EOS resesarch pier: Extreme + +/- 2sd data removed ",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of CeNCOOS")

Again, need to verify that those lower values at the end of 2018.

Hourly median

eos<-subset(eos, select=-c(date))
names(eos)[names(eos) == "datetime"] <- "date"
eos$date<-as.POSIXct(eos$date, format = c("%Y-%m-%dT%H:%M:%SZ"))

eos.1hr.med<- timeAverage(eos,
            avg.time="1 hour",
            data.thresh = 0,
            statistic = "median",
            interval = "6 min",
            remove.calm = FALSE,
            type = "default")
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
eos.1hr.med<-subset(eos.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))

eos_do<-eos.1hr.med %>% ggplot(aes(x=date, y=odo, color=odo))+
  geom_point(alpha=0.5)+ylim(0,25)+
  labs(title="Hourly median dissolved oxygen from EOS resesarch pier",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of CeNCOOS")
eos_do
## Warning: Removed 231 rows containing missing values (geom_point).

Looks good except for that chunk of low values at the end of 2018.

Format and save

names(eos.1hr.med)[names(eos.1hr.med) == "date"] <- "datetime"
eos.1hr.med$date<-as.Date(eos.1hr.med$datetime, format = c("%Y-%m-%d"))
write.csv(eos.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/eos_do.csv")

Richardson Bay

Set up

rb<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/richardson_bay_all.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)

rb$date<-as.Date(rb$DateTimeStamp, format = c("%m/%d/%y"))
rb$datetime<-as.POSIXct(rb$DateTimeStamp, format = c("%m/%d/%y %H:%M"))

rb<-subset(rb, select=c(date, datetime, DO_mgl, F_DO_mgl, DateTimeStamp))

rb%>%
  ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from Richardson Bay: All data",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 1508 rows containing missing values (geom_point).

Remove flagged data that did not pass QAQC

rb<-rb[- grep("-3", rb$F_DO_mgl),]


rb%>%
  ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from Richardson Bay: Failed QC data removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 1508 rows containing missing values (geom_point).

Remove data points flagged as “suspect”

rb<-rb[- grep("1", rb$F_DO_mgl),]

rb%>%
  ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved data from Richardson Bay: Failed & suspect QC data removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 1508 rows containing missing values (geom_point).

Roll apply using custom stat function. Data collected every 15 min so width 2hrs=8 observations

rb<-rb%>%
  tq_mutate(
    select= DO_mgl,
    mutate_fun= rollapply,
    width= 8,
    align= "right",
    by.column= FALSE,
    FUN= custom_stat_fun_2,
    na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique

Filter

rb<-filter(rb, DO_mgl <hi.95 & DO_mgl >lo.95)

rb%>%
  ggplot(aes(x=datetime, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+
  labs(title="Dissolved oxygen data from Richardson Bay: Failed & suspect QC data + -/+2sd removed",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
## Warning: Removed 7 rows containing missing values (geom_point).

Hourly median

rb<-subset(rb, select=-c(date))
rb$date<-as.POSIXct(rb$datetime, format = c("%m/%d/%y %H:%M"))


rb.1hr.med<- timeAverage(rb,
            avg.time="1 hour",
            data.thresh = 0,
            statistic = "median",
            interval = "15 min",
            remove.calm = FALSE,
            type = "default")
## Warning: Missing dates detected, removing 7 lines
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
rb_do<-rb%>%
  ggplot(aes(x=date, y=DO_mgl, color=DO_mgl))+
  geom_point(alpha=0.5)+ylim(0,25)+
  labs(title="Hourly median dissolved oxygen data from Richardson Bay",
       subtitle="01/01/2017 - 12/31/2019",
       caption= "data courtesy of NERR")
rb_do
## Warning: Removed 7 rows containing missing values (geom_point).

I think this looks good.

Format and save

rb.1hr.med<-subset(rb.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))
names(rb.1hr.med)[names(rb.1hr.med) == "DO_mgl"] <- "do"
rb.1hr.med$date<-as.Date(rb.1hr.med$date, format = c("%Y-%m-%d"))
write.csv(rb.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/rb_do.csv")

Fort Point

No dissolved oxygen data at this station.

Graph together

all_do<-ggarrange(cc_do, eos_do, rb_do, nrow=3, ncol=1)
## Warning: Removed 1857 rows containing missing values (geom_point).
## Warning: Removed 231 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
all_do